This project takes NBA scheduling data and answers basic questions, then creates trends and visualizations for schedules across a decade using ggplot and the plotly package. Then a model is created to measure the amount of wins each team has gained or lost due to scheduling variables. The data to this project will be unavailable to you because it comes from a source that doesn’t want to share that information so the answers to the questions will be hard to judge accuracy but I hope the code and analysis can demonstrate my Data Science skills in R. To remind I am also proficient in SQL, JAVA, and Python.
Note:
Throughout this document, any season column
represents the year each season started. For example, the 2015-16 season
will be in the dataset as 2015. We may refer to a season by just this
number (e.g. 2015) instead of the full text (e.g. 2015-16).
library(tidyverse)
library(readr)
# Note, you will likely have to change these paths. If your data is in the same folder as this project,
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
#These lines assign the excel spreadsheets to data frames
locations <- read_csv("Data/locations.csv")
schedule <- read_csv("Data/schedule.csv")
schedule_24_partial <- read_csv("Data/schedule_24_partial.csv")
team_game_data <- read_csv("Data/team_game_data.csv")
QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)
# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.
#This line creates a new df including only OKC's 2024 season games as well as arranges the games chronologically.(df = data frame)
OKC24 <- schedule_24_partial%>%
filter(team == "OKC")%>%
arrange(gamedate)
#Tracks how many instances of 4 games in 6 days we accumulate running through the df.
tracker <- 0
#This loop generates how many 4 in 6 's OKC had in their 80 game draft schedule for the `24 season. I accomplished this going to the i (1st) game chronologically and subtracting its date by the i+3 (4th) game date and if that number was less than or equal to 5 then the team had played 4 games within 6 days. You must choose 5 instead of 6 because the 1st game date needs to count as a 0 but really counts as a 1
for(i in seq_along(OKC24$gamedate))
{
if((as.numeric(OKC24$gamedate[i + 3] - OKC24$gamedate[i])) <= 5 &&
!is.na(as.numeric(OKC24$gamedate[i + 3] - OKC24$gamedate[i])))
{
tracker <- tracker + 1
}
}
ANSWER 1:
26 4-in-6 stretches in OKC’s draft schedule.
QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.
#Arranges the schedule df in alphabetical order by team and then by each game for that team from the `14-`23 season chronologically
schedule <- schedule%>%
arrange(team , gamedate)%>%
mutate(is_4in6 = 0)
tracker2 <- 0
#Generates the total amount of 4 in 6's in the schedule df using the same logic as before
for(b in seq_along(schedule$gamedate))
{
if((as.numeric(schedule$gamedate[b + 3] - schedule$gamedate[b])) <= 5 &&
!is.na(as.numeric(schedule$gamedate[b + 3] - schedule$gamedate[b])))
{
tracker2 <- tracker2 + 1
schedule$is_4in6[b+3] <- 1
}
}
#These variables help make the statistic per82 games
amount_of_seasons <- last(schedule$season) - schedule$season[1] + 1
amount_of_teams <- as.double(n_distinct(schedule$team))
tracker2per82 <- tracker2 / as.double(length(schedule$gamedate))
answer2 <- tracker2per82 * 82
ANSWER 2:
25.3 4-in-6 stretches on average.
QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.
#Used to create a cut line in the new df to get rid of the large conglomerate of useless NA cells.
cutline <- 82 * amount_of_seasons
#This new df holds each team as an individual column with its gamedate data in the rows. I chose to keep this list in df format because I find it easier to view and traversing through the NA cells isn't a problem for my task.
schedule_w_team <- schedule%>%
mutate(row = row_number())%>%
pivot_wider(id_cols = row , names_from = team , values_from = gamedate)%>%
select(-row)%>%
mutate(across(everything(), ~
{
#This mutate creates a new list containing all of the non NA data in the df and then creates a new character vector that puts the non NA data at the top and fills the bottom with NA's based off the length of the end of the column subtracted by the length of the non NA list.
non_na <- .x[!is.na(.x)]
c(non_na , rep(NA, length(.x) - length(non_na)))
}
))%>%
slice(1:cutline)
#Lowest team average 4 in 6
lowest2 <- 0
lowest1 <- Inf
#This loop goes through the rest of the teams and gets the amount of 4 in 6's they have in their `14-`23 seasons and compares it to the current lowest amount while keeping track of which team is currently the lowest.
for(col_name in names(schedule_w_team))
{
for(c in seq_along(schedule_w_team[[col_name]]))
{
if((as.numeric(schedule_w_team[[col_name]][c + 3] - schedule_w_team[[col_name]][c])) <= 5 &&
!is.na(as.numeric(schedule_w_team[[col_name]][c + 3] - schedule_w_team[[col_name]][c])))
{
lowest2 <- lowest2 + 1
}
}
if(lowest2 < lowest1)
{
lowest1 <- lowest2
lowest_team <- col_name
}
lowest2 <- 0
}
#These variables gather the per82 average between the `14-`23 seasons of the lowest team.
games_lowest <- schedule_w_team[[lowest_team]][!is.na(schedule_w_team[[lowest_team]])]
games_lowest <- as.double(length(games_lowest))
lowest_per82 <- lowest1/games_lowest * cutline/amount_of_seasons
#Highest team average 4 in 6
highest <- 0
highest1 <- 0
#This loop goes through the teams and gets the amount of 4 in 6's they have in their `14-`23 seasons and compares it to the current highest amount while keeping track of which team is currently the highest.
for(col_name2 in names(schedule_w_team))
{
for(d in seq_along(schedule_w_team[[col_name2]]))
{
if((as.numeric(schedule_w_team[[col_name2]][d + 3] - schedule_w_team[[col_name2]][d])) <= 5 &&
!is.na(as.numeric(schedule_w_team[[col_name2]][d + 3] - schedule_w_team[[col_name2]][d])))
{
highest <- highest + 1
}
}
if(highest > highest1)
{
highest1 <- highest
highest_team <- col_name2
}
highest <- 0
}
#These variables gather the per82 average between the `14-`23 seasons of the highest team.
games_highest <- schedule_w_team[[highest_team]][!is.na(schedule_w_team[[highest_team]])]
games_highest <- as.double(length(games_highest))
highest_per82 <- highest1/games_highest * cutline/amount_of_seasons
#These variables calculate a Margin of error for NBA team's average 4 in 6's from `14-`23 seasons which I used for some analysis
MOE_high <- highest_per82-answer2
MOE_low <- answer2 - lowest_per82
ANSWER 3:
QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?
ANSWER 4:
The difference between the most and the least from Q3 is not that surprising to me considering that if the NBA aimed to make every team have the average amount of 4 in 6 games per season (25.3481927), there would be a margin of error caused by chance/outside factors. These factors could be tight scheduling caused by large distance destinations or just general time crunch the NBA needs to follow in order to have all teams complete a certain amount of games by the All Star break or their total 82 by the end of the season.The NBA is not perfect when it comes to scheduling and managing this amount of teams makes giving each team the same number of 4 in 6’s difficult. Either way the fact that the average margin of error over a 10 year average of the whole NBA is +2.6745059 and - 3.0408879 occurrences of 4 in 6 games makes each team’s schedule pretty fair in comparison.
QUESTION: What was BKN’s defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?
#Creates a new variable OffB2B to determine whether the offensive team is on the second game of a back to back and rearranges the rows and data
team_game_data <- team_game_data%>%
mutate(OffB2B = 0)%>%
arrange(off_team , gamedate)%>%
relocate(OffB2B , gamedate)
#This loop actually fills OffB2B with a 1 if the offensive team is on the second game of a back to back using the same logic as the 4 in 6 loop but setting the value equal to 1.
for(a in seq_along(team_game_data$gamedate))
{
if((as.numeric(team_game_data$gamedate[a + 1] - team_game_data$gamedate[a])) == 1 &&
!is.na(as.numeric(team_game_data$gamedate[a + 1] - team_game_data$gamedate[a])))
{
team_game_data$OffB2B[a + 1] <- 1
}
}
#Creates a new df including the rows below as well as filters only games in the `23 season where BKN is the defensive team
BKN23D <- team_game_data%>%
select(gamedate , off_team, def_team, fgmade, fg3made, fgattempted, OffB2B)%>%
filter(gamedate > as.Date("2023-08-06") & gamedate < as.Date("2024-08-06") & def_team == "BKN")
#Calculates defensive eFG%
eFG <- ((sum(BKN23D$fgmade) + .5 * sum(BKN23D$fg3made))/sum(BKN23D$fgattempted)) * 100
#Creates a new df including only games from BKN23D where the opponent was on the second night of a back to back
BKN23_O_B2B <- filter(BKN23D , OffB2B == 1)
#Calculates the eFG% of all back to backs
eFG_B2B <- ((sum(BKN23_O_B2B$fgmade) + .5 * sum(BKN23_O_B2B$fg3made))/sum(BKN23_O_B2B$fgattempted)) * 100
ANSWER 5:
QUESTION: Please identify at least 2 trends in scheduling over time. In other words, how are the more recent schedules different from the schedules of the past? Please include a visual (plot or styled table) highlighting or explaining each trend and include a brief written description of your findings.
ANSWER 6:
Trend #1: An increase in back to back opponents for all teams. Analysis below.
#Creates two new columns, to determine whether a team has played the same opponent two games in a row and to average that number across the 30 teams
schedule <- schedule%>%
mutate(is_same_opp = ifelse((schedule$season == lag(season)),
ifelse(schedule$opponent == lag(opponent), 1 , 0) , 0))%>%
mutate(is_same_opp_per_team = is_same_opp/30)
#Creates the first plot showing total "series" (see definition below) across each season
ggplot(schedule , aes(x = season , y = is_same_opp))+
geom_bar(stat = "identity", fill = "blue", width = .8)+
ggtitle('Amount of regular season "series" played in a season across all teams')+
ylab("")+
xlab("Season")
#Creates the second plot showing average "series" per season per team
ggplot(schedule , aes(x = season , y = is_same_opp_per_team))+
geom_bar(stat = "identity", fill = "darkorange1", width = .8)+
ggtitle('Amount of average regular season "series" played in a season per team')+
ylab("")+
xlab("Season")
series - in a regular season context it is essentially the same as the MLB where a team plays the same team in consecutive games, but is a fairly rare occurrence in the NBA
Analysis: Starting from the ’14 season the NBA maintained a pretty steady rate of scheduling the same opponents in consecutive games. That number sat at about 30-50 times per season or an average of about 1-1.3 times per team. However, in the ’20 season this number rocketed up towards about 185 times meaning each team averaged about 6.1 series. This number can most likely be attributed to the fact that the NBA had to adjust to Covid-19 restrictions in the US on things such as travel and sporting games. NBA schedule makers likely had teams play more series to ease the amount of travel each team was doing that season. So once a lot of these restrictions were lifted it makes sense that this number dropped drastically to about 80 times in the ’21 season. However what was unexpected is the sudden increase during the ’22 season that persisted into the ’23 season. The number sat from about 150 and 125 times meaning each team was averaging 4-5 series across those two seasons. A strange increase from what historically maintained almost a constant rate of 1. I understand the idea that things take time to deviate back to the mean after a large event like COVID but as the effects of it started to diminish the numbers rose back up after falling. And even if this was some weird second wave that some effects see, each season is its own individual entity a year apart from the last meaning the schedulers either saw something they liked about having a large amount of series in a season (something that barely even existed in basketball before) or unknowingly did this. The fact that the NBA is a multi-billion dollar company means my guess is that they did this on purpose and I want to know why.
Why this matters: In the past couple seasons I have kind of noticed an increase in series and personally I don’t really like them. After losing to the Magic I don’t want to tune back into my team the next day or two play the Magic again(No offense Magic). It doesn’t give me a big problem but it could be one of the little things that edges NBA consumer satisfaction a little lower. I know some people may like this and I would need to do surveying, t tests, and chi square tests to prove that the NBA viewer population agrees and that it is in fact lowering consumer satisfaction, but I would definitely not count it out as a factor. And in an industry this big consumer satisfaction dropping at any scale could mean a loss of millions.
Trend 2: Steady decline in amount of back to back games played
#Creates a new column determining whether a team was playing a back to back game
schedule <- mutate(schedule , is_b2b = 0)
#This loop populates the new column with a 0 - not a back to back game or a 1 - is a back to back game.
for(y in seq_along(team_game_data$gamedate))
{
if((as.numeric(schedule$gamedate[y + 1] - schedule$gamedate[y])) == 1 &&
!is.na(as.numeric(schedule$gamedate[y + 1] - schedule$gamedate[y])))
{
schedule$is_b2b[y + 1] <- 1
}
}
#Creates the first plot showing total back to backs played across all teams by season
ggplot(schedule , aes(x = season , y = is_b2b))+
geom_bar(stat = "identity", fill = "forestgreen", width = .7)+
ggtitle('Amount of back to backs played in a season across all teams')+
ylab("")+
xlab("Season")+
scale_y_continuous(
limits = c(0,700),
breaks = seq(0, 700, by = 100)
)
#Creates a new column to make a per team average amount of back to backs played in a season
schedule <- mutate(schedule, is_b2b_per_team = is_b2b/30)
#Creates the second plot showing the amount of average back to backs played in a season per team
ggplot(schedule , aes(x = season , y = is_b2b_per_team))+
geom_bar(stat = "identity", fill = "darkorchid", width = .7)+
ggtitle('Amount of average back to backs played in a season per team')+
ylab("")+
xlab("Season")+
scale_y_continuous(
limits = c(0,25),
breaks = seq(0, 25, by = 10)
)
Analysis: Unlike the previous trend, the amount of back to backs played each season has been steadily dropping from 525 or about 20 times per team since the ’14 season with an all time low in the ’19 season at about 300 or 10 times per team. However, this is most likely due to Covid-19 as well. The graph then spikes back up a little bit, but settles at around 400 to 425 or 13-14 times per season, which is around the amount right before the shortened Covid season. This is an interesting but predictable phenomena due to players vying for less back to backs to help aid rest and avoid injuries. The NBA seems to have listened and has opted to make each team’s schedule have 13 to 14 back to backs, making the seasons feel longer for fans, but keeping players happy and healthy.
QUESTION: Please design a plotting tool to help visualize a team’s schedule for a season. The plot should cover the whole season and should help the viewer contextualize and understand a team’s schedule, potentially highlighting periods of excessive travel, dense blocks of games, or other schedule anomalies. If you can, making the plots interactive (for example through the plotly package) is a bonus.
Please use this tool to plot OKC and DEN’s provided 80-game 2024-25 schedules.
library(plotly)
#Rearranges the df by chronological order by team and makes a new column whether the team under the team column is playing home or away in order for the shape mapping in the later ggplot to work properly
schedule_24_partial <- schedule_24_partial%>%
mutate(home_new = ifelse(home == 0, "away", "home"))%>%
arrange(team, gamedate)
#Creating the Miles Traveled between games variable
#Joins the stadium location coordinates into the df to align with the current team under the team column
schedule_24_partial <- schedule_24_partial%>%
left_join(locations, by = "team", relationship = "many-to-many")%>%
rename(team_latitude = latitude , team_longitude = longitude)
#Renames the team variable in the locations df to opponent
locations <- rename(locations, opponent = team)
#Joins the same data as the previous join but aligns the coordinates with the corresponding opposing team column
schedule_24_partial <- schedule_24_partial%>%
left_join(locations, by = "opponent", relationship = "many-to-many")%>%
rename(opp_latitude = latitude , opp_longitude = longitude)
#Renames the previously named variable because it is not good practice to change Foreign Keys permanently
locations <- rename(locations, team = opponent)
#Contains the disthaversine function that estimates mileage between two coordinates
library(geosphere)
#Creates an empty matrix
coord_list2 <- matrix(NA_real_, nrow = nrow(schedule_24_partial), ncol=2)
#Fills the empty matrix with the location of the game's longitude and latitude
for(h in seq_along(schedule_24_partial$season))
{
#fills rows with away team's coordinates if away
if(schedule_24_partial$home[h] == 0)
{
#[h,] fills the row with the chracter vector and seperates the values into the columns for me
coord_list2[h,] <- c(schedule_24_partial$opp_longitude[h], schedule_24_partial$opp_latitude[h])
}
#fills rows with home team's coordinates if home
if(schedule_24_partial$home[h] == 1)
{
coord_list2[h,] <- c(schedule_24_partial$team_longitude[h],schedule_24_partial$team_latitude[h])
}
}
#Creates longitude and latitude columns in the schedule_24_partial df and populates them with the row's game location coordinates
schedule_24_partial <- schedule_24_partial%>%
mutate(location_longitude = coord_list2[,1], location_latitude = coord_list2[,2])
#Creates a new column to show the distance traveled from the previous game and if the game is away and the first of the season the miles traveled from the team's home will be shown
schedule_24_partial <- schedule_24_partial %>%
mutate(
prev_lon = lag(location_longitude),
prev_lat = lag(location_latitude),
miles_travelled = ifelse(
team != lag(team), 0,
distHaversine(cbind(location_longitude, location_latitude),
cbind(prev_lon, prev_lat)) * 0.000621371
)
)%>%
#From team's home if first of the season (uses team instead of season since seasons are the same in df)
mutate(miles_travelled = ifelse(team!= lag(team) & home == 0 , distHaversine(cbind(location_longitude, location_latitude),cbind(team_longitude, team_latitude)) * 0.000621371, miles_travelled))%>%
#Also does the same function but only for the first row in the df since it was NA
mutate(miles_travelled = ifelse(is.na(lag(team)), distHaversine(cbind(location_longitude, location_latitude),cbind(team_longitude, team_latitude)) * 0.000621371, miles_travelled))
#Gets rid of unnecessary columns
schedule_24_partial <- select(schedule_24_partial, -team_latitude, -team_longitude, -opp_latitude, -opp_longitude, -timezone.x, -timezone.y)
#Creates the static ggplot as well as the tooltip that will be used in the interactive plot
schedule_visual <- schedule_24_partial%>%
ggplot()+
geom_point(aes(x = gamedate, y = miles_travelled, color = team, shape = home_new,
text = paste("Date: " , gamedate,
"<br>Miles Traveled: " , round(miles_travelled, 1),
"<br>Team: ", team,
"<br>Away or Home: ", home_new,
"<br>Opponent: ", opponent)))+
labs(
x = "Date",
y = "Miles Traveled",
title = "OKC and DEN Interactive Scheduling Tool",
color = "Key",
shape = ""
)
#Turns the static ggplot into an interactive one
ggplotly(schedule_visual, tooltip = "text")
ANSWER 7:
QUESTION: Using your tool, what is the best and worst part of OKC’s 2024-25 draft schedule? Please give your answer as a short brief to members of the front office and coaching staff to set expectations going into the season. You can include context from past schedules.
#This chunk will be to create interactive plots for past OKC schedules.
#Joins the stadium location coordinates into the new df to align with the current team under the team column
schedule <- schedule%>%
left_join(locations, by = "team", relationship = "many-to-many")%>%
rename(team_latitude = latitude , team_longitude = longitude)
#Renames the team variable in the locations df to opponent
locations <- rename(locations, opponent = team)
#Joins the same data as the previous join but aligns the coordinates with the corresponding opposing team column
schedule <- schedule%>%
left_join(locations, by = "opponent", relationship = "many-to-many")%>%
rename(opp_latitude = latitude , opp_longitude = longitude)
#Renames the previously named variable because it is not good practice to change Foreign Keys permanently
locations <- rename(locations, team = opponent)
#Creates an empty matrix
coord_list3 <- matrix(NA_real_, nrow = nrow(schedule), ncol=2)
#Fills the empty matrix with the location of the game's longitude and latitude
for(j in seq_along(schedule$season))
{
if(schedule$home[j] == 0)
{
coord_list3[j,] <- c(schedule$opp_longitude[j], schedule$opp_latitude[j])
}
if(schedule$home[j] == 1)
{
coord_list3[j,] <- c(schedule$team_longitude[j],schedule$team_latitude[j])
}
}
#Creates longitude and latitude columns in the df and populates them with the row's game location coordinates
schedule <- schedule%>%
mutate(location_longitude = coord_list3[,1], location_latitude = coord_list3[,2])
#Creates a new column to show the distance traveled from the previous game and if the game is away and the first of the season the miles traveled from the team's home will be shown
schedule <- schedule %>%
mutate(
prev_lon = lag(location_longitude),
prev_lat = lag(location_latitude),
miles_travelled = ifelse(
season != lag(season), 0,
distHaversine(cbind(location_longitude, location_latitude),
cbind(prev_lon, prev_lat)) * 0.000621371
)
)%>%
#These use season because there are multiple seasons in the df and when teams switch seasons will also switch so it covers both the team switch in the column and the season switch
mutate(miles_travelled = ifelse(season!= lag(season) & home == 0 , distHaversine(cbind(location_longitude, location_latitude),cbind(team_longitude, team_latitude)) * 0.000621371, miles_travelled))%>%
mutate(miles_travelled = ifelse(is.na(lag(season)), distHaversine(cbind(location_longitude, location_latitude),cbind(team_longitude, team_latitude)) * 0.000621371, miles_travelled))
#Gets rid of unnecessary columns and creates a new column indicating if the game is home or away so the shape mapping will work properly
schedule <- schedule%>%
select(-team_latitude, -team_longitude, -opp_latitude, -opp_longitude, -timezone.x, -timezone.y)%>%
mutate(home_new = ifelse(home == 0, "away", "home"))
#This line creates the static plot of OKC's schedule in past seasons and does this by setting getting games in the season before September to be in 2001 (a placeholder year) and after 2000 (another placeholder). The scale is made from October to May in order to account for the fact later games in the year are first, which is why the placeholder years are so important. Then for the key it shows the season each game corresponds to and in the hover info shows just the month and day but also the season so you know which season that game is apart of.
schedule_previous_visual2 <- schedule%>%
filter(team == "OKC")%>%
mutate(gamedate = ifelse(month(gamedate) < 9, make_date(2001, month(gamedate), day(gamedate)), no = make_date(2000, month(gamedate), day(gamedate))))%>%
ggplot()+
geom_point(aes(x = as.Date(gamedate), y = miles_travelled, color = as.character(season),
shape = home_new,
text = paste("Date: " , month(as.Date(gamedate)), "/", day(as.Date(gamedate)),
"<br>Season: ", season,
"<br>Miles Traveled: " , round(miles_travelled, 1),
"<br>Away or Home: ", home_new,
"<br>Opponent: ", opponent)))+
labs(
x = "Date",
y = "Miles Traveled",
title = "OKC past seasons",
color = "Key",
shape=""
)+
#I made the scale for this graph similar to the scale of the draft `24 season and I know this means I left some observations out in years like the COVID years, however I want to make comparisons with schedules that are actually in the same time frame as the `24 so a smaller scale is more important visually to me than including outliers.
scale_x_date(date_breaks = "1 month", date_labels = "%b", limits = as.Date(c("2000-10-10", "2001-04-25")))
#Turns the static ggplot into an interactive one
ggplotly(schedule_previous_visual2, tooltip = "text")
#To just select options instead of get rid of just double click on the key
General Brief: For the first couple weeks of the season OKC is mostly traveling around 500 - 1500 miles between games with an even mix of home and away games. Then for about a week OKC plays at home having to travel 0 miles. However, after this period they go on a pretty decent road trip for about a month traveling from 70 to 1375 miles with a few home games sprinkled in, but still constantly on the road between games. Around the new year OKC has another little week where they play at home and travel another total of 0 miles. After, they go on a short little road trip logging low mileage and end up with a couple back to back home games afterwards. They go on a couple far trips, but then end back up at home for a couple games and make a couple short trips in between. Afterwards, another large period of travel comes in with a large percentage of away games as they go into March. To close out the season OKC has a couple streaks of back to back home games separated by a couple short and long trips.
The best: The best part about the 2024 season is that a large percentage of games where travelling is involved is within the bottom 2 sectors (0 to 500 and 500 to 1000) of miles traveled and they only have 15 games where they needed to travel over 1000 miles. The team has overall less distance traveled meaning they can get better rest, avoid jet lag, and overall be in better shape when it comes to playing during the season. This same trend can be seen in the 2014, 2015, 2017, 2018, 2021, 2022, 2023 who all had only about 13-16 games (excluding COVID seasons for this observation) in the above 1000 mile sector. The only season unlike this was the 2016 season where 22 games involving travelling was over 1000 miles. A difference that can definitely make an impact on a team.
The worst: Although the `24 season includes a lot of games where there is zero traveling there is also a month and a half stretch where OKC is constantly traveling in between games from the middle of November to the beginning of January. This could be detrimental to a team’s physical and mental health. Traveling this much takes a toll on the body and mind and separates players from their families for a long time. This stretch is also in the earlier parts of the season meaning OKC could become affected by this and not be able to recover throughout the rest of the season and could possibly drop a decent amount of games they should win. The 2016 season also shows a similar stretch from the middle of January to the beginning of February. While most of the other seasons have their zero travel days spread out through the whole year, something that is definitely favored by organizations trying to achieve balance in their schedules.
ANSWER 8:
QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.
If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).
#Creates a new df that includes only the `19-`23 seasons and a select few variables
season19_23 <- schedule%>%
filter(season >= 2019)%>%
select(season, gamedate, team, opponent, home, win, is_b2b, miles_travelled)
#Creates a new variables determining whether a team is travelling in back to back games to help identify any travel clumps
season19_23 <- mutate(season19_23, is_travelling_b2b = ifelse((season19_23$season == lag(season)),
ifelse(season19_23$home == lag(home),
ifelse((season19_23$home == 0), 1, 0), 0), 0))
#Makes the first row of the is_travelling_b2b column 0 instead of NA
season19_23$is_travelling_b2b[is.na(season19_23$is_travelling_b2b)] <- 0
#Creates the linear model
model <- lm(win ~ is_travelling_b2b + miles_travelled + is_b2b, data = season19_23)
summary(model)
##
## Call:
## lm(formula = win ~ is_travelling_b2b + miles_travelled + is_b2b,
## data = season19_23)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54917 -0.49517 0.03808 0.48905 0.62393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.492e-01 7.334e-03 74.879 < 2e-16 ***
## is_travelling_b2b -5.759e-02 1.035e-02 -5.564 2.70e-08 ***
## miles_travelled -4.109e-05 8.485e-06 -4.843 1.29e-06 ***
## is_b2b -6.465e-02 1.236e-02 -5.230 1.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4982 on 11654 degrees of freedom
## Multiple R-squared: 0.007346, Adjusted R-squared: 0.00709
## F-statistic: 28.75 on 3 and 11654 DF, p-value: < 2.2e-16
#Contains the VIF function
library(car)
#Shows the Variance Inflation Factors
vif(model)
## is_travelling_b2b miles_travelled is_b2b
## 1.015400 1.022766 1.027152
Model explanation: Wins predicted by whether a team is travelling in back to back games in order to identify travel clumps, miles traveled to identify any long flights, and if a game is a back to back game. All variables that could create poor play due to exhaustion and may lead to losses.
Model analysis: I decided to use a linear model to help track how much variance in expected wins could be explained by scheduling variables. However, after messing around with several factors such as if the game was a series or if the game was a 4 in 6 game I ended at the best possible outcome where all the explanatory variables were statistically significant with a p value of <.05. The model turned out to use miles traveled, if a team was travelling back to back, and if the game was a back to back game. Unfortunately, I was only able to capture a r^2 value of a little less than 1%, but it was significant due to the F statistic having a p value of <.05 as well. I am displeased with my model, but after going through numerous options excluding strength of opponent(because I felt like this variable went a little farther than strictly scheduling) I came to the conclusion that I was actually pleased with my model. Sure there can be room for improvement but I made three significant variables (and learned a completely new function(distHaversine) to calculate distance between two coordinates) that I felt affected a NBA team’s performance to a significant extent. And yes, it does not explain much variance, but ultimately these factors are not the final tell of if a team will win that night. You must use the statistic that decides if you ultimately win the game or not, points scored and points let up, as well as other various game performance statistics. So overall, my conclusions came out weak, but also strong. My factors are significant and do not have collinearity (see VIF < 10 for all), but scheduling is not a significant factor in determining whether a team will win that game.
#Creates a new column called win factor and populates it with the amount of wins each game is likely to produce using the model created above
season19_23 <- season19_23%>%
mutate(win_factor = coef(model)["(Intercept)"] + is_travelling_b2b*coef(model)["is_travelling_b2b"] + miles_travelled*coef(model)["miles_travelled"]+ is_b2b * coef(model)["is_b2b"])
#These lines are used to help organize future dfs
amount_of_seasons2 <- last(season19_23$season) - season19_23$season[1] + 1
cutline2 <- amount_of_seasons2 * 82
#This line creates a new df that puts each win_factor in chronological order with its respective team.
season19_23_wins <- season19_23%>%
mutate(row = row_number())%>%
pivot_wider(id_cols = row , names_from = team , values_from = win_factor)%>%
select(-row)%>%
mutate(across(everything(), ~
{
#This function creates a new list containing all of the non NA data in the df and then creates a new character vector that puts the non NA data at the top and fills the bottom with NA's based off the length of the end of the column subtracted by the length of the non NA list.
non_na <- .x[!is.na(.x)]
c(non_na , rep(NA, length(.x) - length(non_na)))
}
))%>%
slice(1:cutline2)
#Most hurt
wins_lowest <- Inf
wins_lowest2 <- 0
#This loop adds all the win_factor rows for all the teams and finds the team with the highest wins after subtracting each row by .5. I did this because there are 2 outcomes that each have a .5 chance with no outside bias, so the added or lost chance for a win due to scheduling factors that are reflected in the model will be found by subtracting .5 from win-factor.
for(col_name3 in names(season19_23_wins))
{
for(e in seq_along(season19_23_wins[[col_name3]]))
{
if(!is.na(season19_23_wins[[col_name3]][e]))
{
wins_lowest2 <- wins_lowest2 + season19_23_wins[[col_name3]][e] - .5
}
}
if(wins_lowest2 < wins_lowest)
{
wins_lowest <- wins_lowest2
most_hurt <- col_name3
}
wins_lowest2 <- 0
}
#Most helped
wins_highest <- 0
wins_highest2 <- 0
#This loop does the same as the most helped loop but finds the most hurt team.
for(col_name4 in names(season19_23_wins))
{
for(f in seq_along(season19_23_wins[[col_name4]]))
{
if(!is.na(season19_23_wins[[col_name4]][f]))
{
wins_highest2 <- wins_highest2 + season19_23_wins[[col_name4]][f] - .5
}
}
if(wins_highest2 > wins_highest)
{
wins_highest <- wins_highest2
most_helped <- col_name4
}
wins_highest2 <- 0
}
#Sums up each column's rows subtracted by .5 in season19_23 wins and ignores any NA values and puts it into team_wins, a vector
team_wins <- sapply(season19_23_wins, function(row) sum(row - .5, na.rm = TRUE))
#Creates a nice df called wins showing each teams wins_gained from the model
wins <- data.frame(team = names(team_wins), wins_gained = team_wins)
#prints wins
wins
## team wins_gained
## ATL ATL 1.18196970
## BKN BKN 0.05442204
## BOS BOS -0.17343132
## CHA CHA 1.01073554
## CHI CHI 1.19807812
## CLE CLE 1.74243218
## DAL DAL 0.28573307
## DEN DEN 0.13571990
## DET DET 1.63174966
## GSW GSW -1.68274724
## HOU HOU -0.13540172
## IND IND 1.79418785
## LAC LAC -1.18413977
## LAL LAL -0.44134162
## MEM MEM 0.07741422
## MIA MIA -1.08144629
## MIL MIL 1.22759918
## MIN MIN -0.65929802
## NOP NOP -0.76703029
## NYK NYK 0.77914784
## OKC OKC 0.17595639
## ORL ORL -0.68255543
## PHI PHI 0.80177007
## PHX PHX -1.10738646
## POR POR -2.62059396
## SAC SAC -2.07083638
## SAS SAS -0.66337345
## TOR TOR 0.69591682
## UTA UTA -0.57936116
## WAS WAS 1.05611052
#I did the above loops in this chunk for reproducibility instead of just reading off the most hurt and most helped.
ANSWER 9: